See copyright notice at the bottom of this page.
List of All Posters
More Help Requested (March 4, 2004)
Discussion ThreadPosted 1:02 p.m.,
March 22, 2004
(#34) -
Wally Moon
I'm coming in a bit late on this discussion but I wholeheartedly endorse the advise you're getting from Alan Jordan in particular but also several others.
I would throw out only ballots that show completely unambiguously that the respondent did not attempt to perform the task of rating. Thus, the casea where a player gets identical high, or medium, or low marks in all categories is a potential (but not automatic) candidates for this. Such cases are akin to people doing a "sink test" in a medical lab -- throw the sample (observational data) down the drain and write up the analysis based on other criteria. However, you should be very conservative in excluding cases.
I would not throw out any other cases, but preserve them in the larger data set, and then later on either in your choice of measures of central tendency (e.g., medians vs. means, "trimmed" means --leaving off cases with scores > 2 S.D., or other alternatives) you should conduct sensitivity tests or tests of robustness to see whether your larger analysis is affected by the choice of measures or cases.
If you can include all except the most egregious "sink test" cases and still get nearly identical overall statistical results, then you will avoid the perception or accusation that you've cooked your data through selection of the data.
This is akin to running your analysis using a variety of alternative assumptions about missing data, or about the quality of data, and so on. You also want to avoid in principle throwing away variance by assuming that the central tendency based on "most" responses is "right" and the extreme cases are "wrong." In the end, that may actually produce effects in your analysis that are opposite what you might suppose. For example, it could lead to an attenuation of correlations across indicators or measures of performance rather than to an improvement of them, because everyone then starts to become "average" or "modal" in their observed behavior across multiple indicators.
So (1) keep all the data (eliminating only unequivocally sink-test cases); (2) use a variety of measures of central tendency; (3) conduct tests of sensitivity or robustness to see how much the inclusion of deviant or extremee cases affects the data analysis; (4) err toward inclusion of seemingly weird cases rather than exclusion and purification, especially if (as it likely) inclusion of these cases won't make much difference anyway; (5) report the results under different assumptions.
Silver: The Science of Forecasting (March 12, 2004)
Posted 10:43 a.m.,
March 13, 2004
(#15) -
Wally Moon
#11: Do you make similarity the "focal point" of a method, or do you perhaps just factor it in to a Marcel-type system, with a certain "weight" given?
Best I can figure out, it is how the similarity scores are determined that is the main claim to innovation in PECOTA -- where the "secret formula" is. About all we know about the specific method used is in that BP2003 chapter by Silver, with the comparisons between PECOTA and Bill James's method. It is this part of the method that is the "art" or invention of PECOTA and the part that Silver is least willing to share. If anybody else knew how he selected the comparables -- the particular weights given to the 10 factors or variables -- and how he used the similarity scores to make the projections then they'd be able to replicate his predictions.
The "Science of Forecasting" essay uses 4 archtypes as illustrations. Do we know how many are used in PECOTA? I have a hunch he's got a bunch of them, perhaps a series of gradations on multiple dimensions, rather than any fixed number of categories. The 4 types given in this article are just for illustrative purposes, not to say how PECOTA actually operates.
On the number of comparables, I'd be interested in how reliable and accurate the performance predictions are as a function of the number of comparables. Is PECOTA less accurate when there are fewer comparables or when the similarity scores are lower? It would seem likely, but would be interesting to know.
I asserted before that nobody should expect Silver to divulge his "formula." (And the core of that formula is in the determination of comparables.) The proof of the system is in the pudding, not the recipe. If anybody can come up with better predictions of performance, they should do so.
Silver is right that you don't see confidence levels (whatever process may be generating them -- simply N-size, or something more) reported in most baseball forecasting systems. But you'll notice that when he does his comparisons of PECOTA with the competition, he still has to revert to his weighted means to say anything useful about PECOTA's predictive validity vs. the competition.
All this said, I thought this was a well done article, one of the best in the "BP Basics" series.
Silver: The Science of Forecasting (March 12, 2004)
Posted 10:46 p.m.,
March 13, 2004
(#21) -
Wally Moon
"Take the most extreme cases. Player A (a pitcher) has no history so we assign the league mean"
But of course that's not what PECOTA would do. It would use translated minor league (or international) experience to choose comparables instead of assigning the league rookie mean and variance.
Silver: The Science of Forecasting (March 12, 2004)
Posted 12:53 p.m.,
March 15, 2004
(#26) -
Wally Moon
I think for this kind of analysis you somehow have to address the selection bias and simultaneity problem. If a pitcher starts out substantially underperforming his "true" ERA or perhaps his "expected" era based on Marcel, then he's likely to get less playing time. If he outperforms his true ERA he's going to get more playing time. (The same argument would apply to position players.)
A player on a downward spiral within a season and across his career gets less and less playing time, and then unless he gets on a lucky streak he may find that there is no way to get better because he's sitting on the bench. (So he may be traded to a place where his value to the other team exceeds his value to his current team, or he may be demoted to AAA to get more playing time, etc.) And of course, in some cases sharply reduced playing time is associated with injury of some kind, which is just an extreme case of deteriorating productivity (or of unavailability to produce at least in the short run but possibly for a long time or forever--with end of career injury). But in a more typical case the deterioration could just be the result of "aging" or other factors, not least of which might be who else is on the team -- a player performing well below expectations can usually be "replaced," and of course eventually all players are replaced and their production is reduced to zero.
In any case, variance in performance has a reciprocal relationship with playing time (as well as, I would speculate, with under or overperforming "expectations," i.e., not only with the actual level of performance). I'm not quite sure what the implications are for modeling or forecasting player performance.
Silver: The Science of Forecasting (March 12, 2004)
Posted 6:23 p.m.,
March 15, 2004
(#37) -
Wally Moon
If you regress PECOTA against Marcel (with Marcel as the indep. or predictor variable), then you can generate residuals and look for cases where PECOTA is +/- 1 std. dev. from the Marcel prediction. You can then, of course, add "park" or "team" to the regression equation to see how much that accounts for the remaining variance. With OPB, you are already accounting for 79.7% of the variance in PECOTA using Marcel (square of the correlation coefficient), and you are already accounting for 77.8% of the variance in SLG. Thus, you have about 20-22% "unexplained" variance in the predictions -- so how much of that error is reduced when you add "team" or "park," or, for that matter Tango's thing here: PA.
Thanks for considering this added wrinkle.
Silver: The Science of Forecasting (March 12, 2004)
Posted 6:25 p.m.,
March 15, 2004
(#38) -
Wally Moon
I meant when adding "park" or "team" to add dummy variables for the teams (all except 1, of course, as the reference category).